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Abstract 

Bayesian graphical modeling provides an appealing way to obtain uncertainty esti- 
mates when inferring network structures, and much recent progress has been made for 
Gaussian models. These models have been used extensively in applications to gene ex- 
pression data, even in cases where there appears to be significant deviations from the 
Gaussian model. For more robust inferences, it is natural to consider extensions to t- 
distribution models. We argue that the classical multivariate ^-distribution, defined using 
a single latent Gamma random variable to rescale a Gaussian random vector, is of little 
use in highly multivariate settings, and propose other, more flexible i-distributions. Using 
an independent Gamma-divisor for each component of the random vector defines what we 
term the alternative t-distribution. The associated model allows one to extract informa- 
tion from highly multivariate data even when most experiments contain outliers for some 
of their measurements. However, the use of this alternative model comes at increased 
computational cost and imposes constraints on the achievable correlation structures, rais- 
ing the need for a compromise between the classical and alternative models. To this end 
we propose the use of Dirichlet processes for adaptive clustering of the latent Gamma- 
scalars, each of which may then divide a group of latent Gaussian variables. Dirichlet 
processes are commonly used to cluster independent observations; here they are used in- 
stead to cluster the dependent components of a single observation. The resulting Dirichlet 
t-distribution interpolates naturally between the two extreme cases of the classical and 
alternative ^-distributions and combines more appealing modeling of the multivariate de- 
pendence structure with favorable computational properties. 

Key Words: Bayesian inference, Dirichlet process, graphical model, Markov chain 
Monte Carlo, t-distribution. 



1 Introduction 



Consider a random vector Y = (Yi, . . . , Y p ) and an undirected graph G = (V, E) with vertex 
set V = {1, . . . ,p}. The Gaussian graphical model given by the graph G assumes that Y 
follows a multivariate normal distribution N P ((J,, £), where fi may be any mean vector in 
R p but S is a positive definite covariance matrix that is constrained such that two variables 
Yj and Yk are conditionally independent given all the remaining variables "Y\(jk} whenever 
{j, k} is not an edge in E. The conditional independence holds if and only if E^ 1 = 0; see 
e.g. |Lauritzen (1996). Hence, the graph of a Gaussian graphical model can be inferred from 
data by inferring the pattern of non-zero entries in the precision matrix £ . 

Different Bayesian methods have been developed for an uncertainty assessment in inference 
of the graph. Giudici and Green (1999), for instance, use a uniform prior on decomposable 
graphs and place a Hyper Inverse Wishart prior on the covariance matrix, which allows for 
exact local computation (Dawid and Lauritzen, 1993). In particular, a closed form marginal 



likelihood permits treatment of high-dimensional datasets (Dobra et al. 2004, Jones et al. 



2005 ) . Exact computations for non-decomposable graphs are much more involved ( Roverato 



2002; Dellaportas et al. 2003 Atay-Kayis and Massam, 2005); for an approximate treatment 



see Lenkoski and Dobra (2011). Other recent literature providing different extensions to the 
basic Gaussian model includes Rajaratnam et al. (2008) and Carvalho and Scott (2009). 

While appealing in many respects, Gaussian methods are susceptible to great impact of 
measurement errors. There is a substantial literature on robustness in Bayesian inference, 
including De Finetti (1961) and West (1984), but little work that directly targets graph- 
ical models. One commonly taken approach replaces multivariate normal by multivariate 
i-distributions (with univariate t- margins). T-distributions yield reasonable models for heavy- 
tailed marginal behavior and have the capacity to maintain good statistical efficiency when 
data are in fact Gaussian. Moreover, convenient closed form conditional distributions are 
available for use in Markov chain Monte Carlo algorithms that exploit the representation of 
i-random variables as ratios involving Gaussian and Gamma random variables. 

The classical multivariate i-distribution can be defined in terms of an unobserved Gaussian 
random vector and a single unobserved Gamma divisor. In the context of this paper, modeling 
assumptions then refer to the latent Gaussian vector. Penalized likelihood techniques for 



graphical model selection with this classical t-distribution were explored in Yuan and Huang 



(2009). Highly multivariate data, however, often have pockets of contamination in many 
observations creating a scenario to which the classical i-distribution is poorly suited. This 
paper addresses this problem by developing methods based on more flexible t-distributions. 
The distribution we term the alternative ^distribution has independent Gamma scalars for 
each component of the latent Gaussian vector. This construction has been used in a frequentist 
treatment of graphical models (Finegold and Drton, 2011), but seems to have received little 
attention otherwise. While better suited to higher-dimensional analysis, the distribution's use 
comes at increased computational cost and imposes constraints on the achievable correlation 
structures; see Fig. 9 in Finegold and Drton (2011). It is thus desirable to achieve a trade-off 
between the two extremes, 'classical' versus 'alternative'. 
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The key new contribution of this paper is an adaptive method to interpolate between the 
classical and the alternative case. Our proposal uses Dirichlet processes to cluster the Gamma 
divisors associated with the components of the latent Gaussian vector. The common Gamma- 
value associated with a cluster is then used to divide all the associated Gaussian variables. The 
Dirichlet process clustering thus pertains to the possibly dependent components of a single 
multivariate observation, as opposed to the common case of clustering different independent 
observations. Compared to the alternative case, the Dirichlet i-proposal alleviates constraints 
on correlations and reduces computational effort when a small number of divisors is sufficient. 
While we are concerned with graphical models, there is no limitation to the use of the Dirichlet 
^-framework in other problems of multivariate statistics. 

The paper is organized as follows. Section [2] lays out the Gaussian setup upon which our 
extensions are based. In Section [3j we describe Bayesian inference of network structures with 
classical and alternative i-distributions. The Dirichlet t-distribution is developed in Section 
[4j Numerical experiments (Section [5]) and an analysis of gene expression data (Section [6]) 
demonstrate that our new framework is computationally tractable and statistically efficient 
across a broad spectrum of data with outliers. A conclusion is given Section [7} 



2 A Bayesian Gaussian Graphical Model 
2.1 Background 

Let IW p (m, <&) denote the Inverse Wishart (IW) distribution with degrees of freedom m > 
p — 1 and a positive definite scale matrix This distribution is supported on the cone of 
p x p positive definite matrices and has density 

1*1? ( 1 



p(S I 77i,*) = ^Wyl^l — exp \ --tr (SIT 1 ) \ , (2.1) 



where the determinant of a matrix A is denoted I A\ and 



T p (a) = ^- 1 )/ 4 []r a 



is the multivariate gamma function with argument a > (p — l)/2. The distribution is the 
conjugate prior for the covariance matrix of a multivariate normal distribution. 

Let G = (V, E) be a decomposable graph with vertex set V = {1, ... ,p}, and suppose 
Ci, 52, C2, . . . , S m , C m is a perfect sequence of the graph's cliques C, and separators Si. Here 
and throughout the paper, we assume familiarity with basic graphical concepts as introduced 
in 



Lauritzen 



(1996). Let M + (G) be the set of positive definite matrices XI with entry = 
whenever {j, k} is not an edge in E. For Gaussian graphical modeling, one needs a constrained 
version of the IW distribution for the covariance matrix X that has support such that the 
precision matrix lies in M + (G). The relevant distribution is known as the Hyper IW 
distribution and we denote it by HIW(G, 5, <&), where 5 > is a degrees of freedom parameter 
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and $ is a positive definite scale matrix. The distribution's density 



EI P(EciCi I $ + \Ci\ - 1, &Cid) 
i=i 



i=2 



is the ratio of products of evaluations of the IW density from (2.1 ). It follows from properties 



of the IW distribution that the normalizing constant for the HIW distribution is: 



in 

n 

i=i 



(6+\Ci\-l)/2 



\Ci\ 



S+\Cj\-l 
2 



n 

i=2 



' Si Si 



(5+|Si|-l)/2 



\Si\ 



S+\Si\-l 
2 



(2.2) 



2.2 Model Specification 

We will treat a particular Bayesian Gaussian graphical model that is similar to models in 



work such as Armstrong et al. (2009). The considered prior on graphs, P(G), is supported 
on the set of decomposable graphs with the probabilities of graphs proportional to 



(2.3) 



The parameter d controls the sparsity of the graph. Conditional on G and hyperparameters 
5 and we let the covariance matrix S follow a HIW(G, 5, <fr) distribution, which has the 
consistency property that the submatrix distribution P^dd I G, 5, «&) is the same for any 



graph G with clique C{ (Dawid and Lauritzen 1993). Larger 5 makes the posterior more 



concentrated around the hyperparameter As in Carvalho and Scott (2009) and Donnet 



and Marin (2010), we choose 5 = 1. We use matrix hyperparameter $ = cl p , a scalar multiple 



of the p x p identity matrix. Larger c leads to larger graphs; see Jones et al. (2005). 

Finally, suppose we observe a sample of n independent A/" p (0, X) random vectors Yi, . . . , Y n 
Let Y be the matrix with the vectors Yi as rows. The joint distribution of (Y, G, S) then 
factors as 

n 

P(Y, G, E | S, = P(G)P(E | G, 5, $) ]J P(Y t | S). 

i=l 

Centering Gaussian data by subtracting off the sample mean and assuming mean [i = is 
standard practice since the distribution theory is essentially unchanged. 

2.3 Metropolis-Hastings Sampler 

We now briefly review the posterior sampling scheme used in prior work such as Armstrong 



et al. (2009). Define the sample covariance matrix 

1 n 



i=i 
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and set <&* = 3> + nS and 8* = 8 + n. Then 



(S | Y,$,*,G) ~ fflW(G,<T,**) 



(2.4) 



and 



P(Y\6,*,G) 



1 h(G,5,&) 
(2n) n P/ 2 ' h(G,5*,&* 



(2.5) 



sec 



Dawid and Lauritzen (1993) and Giudici and Green (1999) 



Let G = (V, E) and G' = (V, E') be two decomposable graphs on V = {1, . . . ,p}. Suppose 
that Ci, 52, C2, • • • , SVn, C m is a perfect sequence of the cliques and separators of G and that 
{j, k} £ E. If G' is equal to G except that the edge {j, k} is removed, then the following three 



properties hold; see e.g. Armstrong et al. (2009). First, the edge {j, k} is in a single clique C q 
of G. Second, we have either j ^ S q or k S q . Third, suppose k S q , and let C qi = C q \{k}, 
C92 = \ {j} an d S'g 2 = C q \ {j, /c}. Then C\, S2, • • • , S'g, C qi , S q2 , C q2 , S q +i, . . . , S* m , C m is a 
perfect ordering of all cliques and separators of G' . Let 82 = 8 + \S t 



?2 



and ^2 



5* + IS, 



'12 



The above three observations can be used to show that the ratio of marginal likelihoods 
P(Y I G)/P{Y j G') is equal to 



h(G, 8, <&)h{G' , 8*, 
h(G,8*,&*)h(G\8,$>) 



Here, e = {j, k} and $ ee | 5 _ 



I ^^66| I 



st+i 



1 (^)r(f 



(2.6) 



^eSq2 (^S , q2S' g 2 



) 2e ; the conditional variances for 



j and are defined similarly. The ratio in (2.6) allows one to create a Markov chain with 



the posterior distribution P(G \ Y) as the stationary distribution by applying a Metropolis- 
Hastings procedure that avoids sampling of S. 

Algorithm 1 (Gaussian). Starting with a decomposable graph Go, repeat the following two 
steps for t = 0,1,2, ... : 

(i) Create a graph G' by randomly picking an edge to delete from Gt or to add to Gt- 

(ii) If G' is decomposable, accept the move Gt+i = G' with probability 



min < 1, 



P(Y I G') 



P{Y I G) 

setting Gt+i = G if the move is rejected or G' is not decomposable. 



(2.7) 



Decomposability of the input Gq can be tested with the Max-Cardinality algorithm (Cowell 



et al. 1999). Given the decomposable graph Gq, the set of all decomposable graphs can be 



traversed following simple rules for edge addition and deletion (Giudici and Green, 1999) 
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3 Bayesian Graphical Models for ^-Distributions 



3.1 Classical and Alternative Multivariate t-Distributions 

The classical multivariate t-distribution t PtU (fi, 1 J r ) in M p has density 

r(^±£)|*|-V2 

where 5 y (fj., = {y — fi) T '&~ 1 (y — /x) and y G M p . The vector /x 6 M p is the mean vector. 
The scale parameter matrix \I/ = (if)jk) is assumed positive definite. For degrees of freedom 
v > 2, the covariance matrix exists and is equal to vj{y — 2) times 1 J r . If X ~ A/^(0,\E r ) 
is a multivariate normal random vector independent of the Gamma-random variable r ~ 



r(i//2, i//2), then Y = fi + X/y/r has a t PiV (fi, ^-distribution (Kotz and Nadarajah, 2004 
Chap. 1). The heavy tails of the distribution are related to small values of the divisor r. 

The classical t-distribution is useful for robust inference when only a handful of the ob- 
servations are contaminated. When the dimension p is large, however, it is not uncommon 
for contamination to affect rather small parts of many observations. To handle such a sit- 
uation better we consider p independent divisors ti,...,t p ~ T{y /2,v/2). Assuming the 
divisors to be also independent of X ~ Ap(0, we create the random vector Y with co- 
ordinates Yj = fij + Xj I yjrj and define the alternative multivariate t-distribution to be the 
joint distribution of Y. In symbols, Y ~ t* (/i, For robustified inference, the alternative 
t-distribution is appealing as it allows different rescaling of the different components of Y. 



3.2 Bayesian Inference With Classical t-Distributions 

Suppose Y\, . . . ,Y n £ M. p are independent random vectors drawn from the classical multi- 
variate t-distribution t U7 p(fi, \&). Let Y be the matrix with the vectors Yj as rows. We are 
interested in the posterior distribution on graphs, P{G \ Y), where the graph G corresponds 
to conditional independence constraints on the latent multivariate normal vectors Xi, that 
is, an off-diagonal entry 6jk of the matrix = is zero unless {j, k} is an edge in G. 

In the normal model we can center the data by subtracting off the sample mean and 
assume, without loss of generality, that fi = 0. For the t-model, robust estimation of both /x 
and * is desirable, and we thus include fi in our setup. Let r = (ti, . . . ,r n ) be the vector 
of unobserved Gamma-divisors for the n observations Yj. Proceeding as in the normal case, 
our full model factors the joint distribution of Y, r, G, fx as 

n 

P(Y,t,G,*,h) = P(G)P(/*)P(* | G,6,&)H P(Yj | Tj,*,/x)P(rj | v ), (3.2) 

8=1 

where 

(Yj | 7j, *) ~ AT p (n, */ri), (*\G,5,®)~HIW(G,6,®), (3.3) 

(ji | i/) ~r(i//2,i//2), /i~AA p (0,a M -X p ). (3.4) 
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In later numerical work, we choose cr^ large enough for the prior to be "flat" over a range of 



plausible values. Throughout the hyperparameters 5, and v are fixed; recall Section 2.2 



Inference for the Gaussian case is simplified by integrating out the covariance matrix 51; 
recall (2.6). In the i-model, we may condition on r and add/remove edges based on the ratio 

P(Y\G',T,fj,) h(G,5,3>)h(G',5*,<S>* T ) 



P(Y\G,r,v) h(G,5*,** T )h(G',5,*Y 



where 



StYy(h) 



1 n 
n ^ 



and 



$ + nS TY Y(p)- 



(3.5) 



(3.6) 



Drawing t given G, n and Y is difficult, however, and we resort to conditioning on further 
parameters and consider the conditional distribution 

'u + p V + £y 4 (/Lt, *) 



( n I y,/i,*)~r 



(3.7) 



compare 



Liu and Rubin 



(1995). This requires = and we use the method of 



Carvalho 



et al. (2007) to draw from 



(*\Y,G,T,tx)~ HIW(G,6*,$* T ) 



(3.8) 



This procedure first draws S&dCn cycles through the cliques of G to draw given ^s^Sn 

and then uses a standard completion algorithm to determine the values of SI/ not associated 
with any clique. We then invert Vl/ to obtain 0. For decomposable graphs, the inversion can 
be done efficiently using the procedure of Dawid and Lauritzen (1993). That is, we compute 

11 , (3.9) 



= ^(* Cl cJ- 1[ 



i=l 



t=2 



where ad) means that we take the p x p matrix of zeros and add in the elements of 
(*&aCi) hi their appropriate places. This calculation only requires the elements ipjk of Vl/ 
corresponding to edges {j, k} £ E. Therefore, for the purposes of obtaining 0, we need not 
perform the completion step in the method of Carvalho et al. (2007). Moreover, every step 
in the generation and inversion of Vl/ is based on local computations at the clique level. 

Now the conditional distribution (/z | Y,t,&,G) is the multivariate normal distribution 



-1 



© 



E r *) + ^ 



(3.10) 



where M = Ip/cr^. To draw fj, using this conditional distribution we must invert the p x p 
matrix ( Ya=i T i)® + ®m> a potentially computationally expensive procedure. For practical 
applications, we thus simply set 



(3.11) 
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instead of drawing from the conditional distribution in (3.10). We provide theoretical and 
numerical justifications for this alternative in the Appendix. 



Algorithm 2 (Classical t). Starting with a decomposable graph Go, and initial values fx and 
To, repeat the following steps for t = 0,1,2, ... : 

(i) Jointly draw a new graph Gt+i and a new matrix ®t+i o,s follows: 



(a) Draw Gt+i as in Algorithm^ but using the ratio in (3.5). 



(b) Conditional on (Y, Gt+i, T t , fit), sample ®t+i by drawing *t?t+i from (3.8) and inverting 



it using (3.9) 



(ii) Conditional on (Y,Gt+i,®t+i), sample the new independent components of the vector 
Tt+l = ■ ■ ■ ,n+i, n ) from fs/fy. 



(Hi) Set fi t <i to the value in (3.11) 



In practice we hope to improve on the estimate of P{G \ Y) that we would obtain from 



the normal model. If we start with a "good" estimate of r as given by the tlasso of Finegold 



and Drton (2011), we may be able to make considerable improvement over the normal model 



without sampling r after every edge draw. In our later simulations, we thus draw r only 
every k > 1 draws of (G, 0), which does not affect the validity of the sampler. Note also that 
the k — 1 intermediate iterations do not involve \I/ (or its inverse 0). 

3.3 Bayesian Inference With Alternative t-Distributions 

For the alternative i-model, there are only a few differences. The model itself is the same as 



in (3.2) except that now Tj is a p- vector, and 



(Yi | Ti, fi) ~ Afp(fJt, diag(l/VT i ) • * • diag(l/^)), 
(nj | v) ~ T(v/2,v/2), j = l,...,p, 

with Tn, . . . ,Ti P independent given v. For the Gibbs sampler, we cannot draw the matrix 
t = (rij) given Y, 0, G, fj, directly, but we can draw given Tjw, Y, 0, G, fi. The conditional 
density (derived in the Appendix) is 



/(r, 



ij I T i\j 



, Y) = C(a, /3, 7 ) • r"" 1 exp {- nj P - ^7} 



(3.12) 



with 



a 



v + 1 



7 = (Xij - fi j )<r) j jXi r 



2 ' 2 
and normalizing constant C(a, (5, 7). The problem of sampling from this density also arose in 



the work of Finegold and Drton (2011 ), where the density appears in the context of a Markov 



chain Monte Carlo EM algorithm. We employ a rejection sampling scheme from Liu et al. 



(2012). This leads to the following Gibbs sampler. 



S 



Algorithm 3 (Alternative t). Starting with a decomposable graph Go, and initial values fj, 
and To, repeat the following steps for t = 0,1,2, ... : 

(i) Jointly draw a new graph Gt+i and a new matrix ®t+i as in Algorithm^ 

(ii) For each observation i = 1, . .. ,n, cycle through the variables j = 1, . . . ,p and draw Tij 



from its current full conditional in (3.12) to obtain a new matrix Tt+i 



(Hi) Set fi t+ i to the value in (3.11), where t is now a vector and the multiplications and 



divisions are done component-wise. 

This sampling scheme for the alternative model works well for moderate p (p ~ 100) and 
underlies our later simulations. The scheme becomes very computationally intensive, however, 
for large p, both in terms of the time to complete one iteration of step (ii) above and the 
number of iterations required to approach convergence of the Markov chain. It is conceivable 
that other strategies, such as using a Metropolis-Hastings step to sample from P(t\G, Y) 
directly, might perform better. However, we will not treat such alternative sampling schemes 
in the remainder of this paper, which is instead devoted to other models. 

4 Dirichlet ^-Models 

We are faced with a trade-off between the classical and alternative models. If our goal is 
to identify pockets of contamination spread throughout a large data set, we certainly do 
not want to weight an entire observation via a single divisor as in the classical model. In 
the other extreme, with a different divisor for each variable, the alternative model proves to 
be computationally burdensome. Moreover, the alternative model has pairwise correlations 



bounded at a level that is somewhat restrictive for small degrees of freedom v; see Finegold 



and Drton (2011). The approach we propose in this section interpolates between the two 
extremes and seems appealing in particular when there are batches of variables taking on 
extreme values, while the rest exhibit behavior consistent with a normal model. 

If groups of variables are similarly contaminated (or otherwise extreme), we can share 
statistical strength and ease our computational burden by forming clusters of Gamma divisors 
for each observation. We solve this clustering problem via a Dirichlet Process (DP) prior on 
the vector of r divisors for each observation. This Bayesian nonparametric approach avoids 
fixing a number of clusters and truly interpolates between the classical and alternative case. 

4.1 Background on Dirichlet Processes 



The Dirichlet Process is a measure on measures introduced by Ferguson (1973). Let Po be a 
probability measure on a measurable space (0,£>), and a > 0. We say that P is distributed 
according to a Dirichlet process with parameters a and Po if for any finite measurable partition 
{A\, . . . , A r ) of 0, the random vector (P(^4i), . . . , P(A r )) follows a Dirichlet distribution with 
parameters (aPo(Ai), . . . ,aPo(A r )). We write P ~ DP(a,Po). 

The Dirichlet process possesses a clustering property due to the fact that if P ~ DP(a, Pq) 
then P is discrete with probability 1 . This holds even if the base measure Pq is continuous ( |Fer 



guson, 1973). Let ttx, . . . ,ir n be independent draws from a random measure P ~ DP(a,P( 
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Figure 1 : Representation of the process generating a i^-random vector Y from a latent normal 
random vector X and an independent Dirichlet Gamma random vector r. The missing 
undirected edge between X\ and X3 indicates a graphical conditional independence. Directed 
arrows illustrate the functional relationship among X, t, and Y. 



Then the conditional distribution of 7r n given 7r\ n is a mixture, namely, 

a n-1 n_1 

— PoK) + — > ,MO> (4.1) 

N a + n — 1 a + n — 1 J 

where 5^. denotes a point mass at 7Tj. Hence, each new draw has a positive probability of 
assuming the same value as a previous draw, and this probability increases with each new 
draw. The choice of a greatly influences the number of expected clusters, with larger values 



leading to more clusters. New observations taking on the same values as existing ones in (4.1 ) 
gives an intuitive explanation to the phenomenon that Dirichlet processes often produce a 
small number of large clusters. This can be unsuitable for generic clustering applications but 
is, in fact, appealing for the robustification problem we consider. Here, we might often expect 
one large cluster that corresponds to uncontaminated (high-quality) observations. 



4.2 Dirichlet t-Model 

Applying Dirichlet processes in the t-distribution context yields the following construction, 
illustrated in Figure [l} 

Definition 1. Let Pq be the T(is/2, v/2) distribution and let P ~ DP (a, Pq). For j = 1, . . . ,p, 
let Tj ~ P be independent of each other given P. We then say that the random vector t GMP 
follows a Dirichlet Gamma distribution; in symbols r ~ DT p (a,v). If the random vectors 
X ~ A/^(0, \l/) and t ~ DT p (a, v) are independent, then we say that Y 6 W with coordinates 
Yj = fij + Xj/^yfj follows a Dirichlet t-distribution; in symbols Y ~ tp U (fi, \l/). 

The family of Dirichlet t-distribution includes the previous two models as extreme cases. 
When a — > 0, we will have one cluster, giving us the classical ^-distribution. When a — > 00 
we will have p clusters, giving us the alternative t-distribution. 
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Unlike in the alternative model, there is no upper bound on the correlations between two 
variables within a cluster. For any two variables Yj and Yk, the marginal covariance is 



1 



a i/r((i/-l)/2) 2 
a + lu-2 ' a + l 2T(v/2) 2 



+ 



(4.2) 



where l/(a + 1) is the probability that any two variables will be in the same r cluster. As 
a — > we obtain a maximum covariance of ipjk ■ vj(y — 2) and a maximum correlation of 1. 



In contrast to the mixture model of Antoniak (1974), which uses n draws from a Dirichlet 



process for n observations, we here draw p values for the coordinates of a single observation. 



The relevant conditional distributions are similar to those given in Escobar and West ( 1995 ) 



but include terms that reflect that dependence among the p coordinates of an observation. 

For Gibbs sampling we need the following full conditional (derived in the Appendix). For 
notational simplicity we consider Y and r to be p-vectors representing a single observation: 



(Tj I T\j, Y, 0) ~ q Pj(Tj) + ( lo l5 r J , (tj). 



(4.3) 



This mixture distribution involves point masses, denoted S Tk (rj) when supported at t^, and 
the distribution Pj(rj), which is the conditional distribution of Tj given (t\j,Y , 0) from 
the alternative £-model. The mixture weights for the point masses are denoted and are 
proportional to the conditional density of (Yj | 0, Y\j, ru, Tj = Tj/) evaluated at yj. This 
density is that of a normal distribution with mean fj, c / \ffy and variance cr^/r^ evaluated at 
yj —(J,j. We denote the density as fjv(Uj — fJ>j', /WvTf > a c/ T j')- Finally, the remaining mixture 
weight qo is proportional to a times the conditional density of (Yj \ 0, Y\j, T\ ) evaluated at 
yj, where Tj ~ Pq. This density is that of a noncentral ^-distribution with degrees of freedom 
v and noncentrality parameter (j, c /(t c , where \i c and a c are the conditional mean and standard 
deviation of (Xj \ Xu). We denote the density as fcivj — fJ>j] Hc/&c, v)/or c - 

Now define a vector z € W of cluster indicators by setting Zj = k if Tj belongs to the k th 
cluster. In our setting, all Tj in the k th cluster assume the same value, rjk, and thus r is a 



function of z and the vector of cluster values rj. Hence, we may rewrite (4.3) as 



K 



{Tj | T\j,Y, 0) ~ qoPj(Tj) + Qkd 



(4.4) 



k=i 



where K denotes the number of clusters, and q^ is proportional to 



with = \{f 7^ j : Zj> = k\ \ . Rewriting (4.4) using the conditional cluster assignments 
gives 



K 



( Z j | z\j,rj,Y,@) ~ qo5 Znew {zj) +^q k 5k{zj), 



(4.5) 



fc=i 
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where z„ 



K + 1 unless n 



U) 



in which case z n 



3 ■ 



The conditional in (4.5) describes the assignment of one node to a cluster given all the 



other cluster assignments and cluster values. We can also derive the conditional distribution 
of one cluster value given all the other cluster values and all the cluster assignments. Let 
(k) := {j : Zj = k] and n k = \ (k)\. The conditional density (derived in the Appendix) is 

f(Vk I »?\fc, Y, 0, z) = C(a, /3, 7) • r/° _1 exp {-rj k P ■ 



mi} 



(4.6) 



with 



a 



v + n k 



v + tr{& 



(k)(k) Y \k)Y \ k) 



7 = tr(@ 



(k)\(k)X\ {k) Y {k) ) 



The density being similar to (3.12), sampling can be performed using the same rejection 



sampling scheme. When the number of clusters is small relative to p, cycling through the 
clusters and drawing values for the whole cluster is much faster than cycling through all p 
nodes and assigning each to a cluster (and drawing values when new clusters are formed). 

In the algorithm we may repeat some steps more frequently than others. For instance, if we 
are able to quickly identify clusters representing significant deviations from normality, then we 
can perform the third step more frequently than the computationally expensive reclustering 



step. For initial values, we use the tlasso of Finegold and Drton (2011) to estimate r, which 
means we have one cluster for each observation. 



4.3 Inference for the Clustering Parameter a 

The Dirichlet Process parameter a plays a key role in determining the number of clusters, 



and it is beneficial to add another level of hierarchy and place a prior on a. Following Escobar 



and West (1995) who treat Dirichlet process mixture models, we consider a T(a,b) prior on 



a. In practice we choose a and b to give low prior mean to a, which leads to fewer clusters 
and easier computation, but allows for more clusters as the data requires. 

Let k denote the number of clusters. We have from Antoniak ( 1974 ) that, for k = 1, 



. n, 



P(k I a, n) = c n (k)n\a k - 



r(a) 



(4.7) 



T(a + n) ' 

1 , n) . Posterior inference is simplified by the fact that a is condi- 



where c n (k) = P{k \ a 
tionally independent of the observed data given the cluster assignments, leading to 

P(a I k,ir,X) = P(a \ k) oc P(a)P(k \ a). 



(4.8) 



That is, inference is based on the prior for a and a single observation from P(k \ a,n). 

Returning to our setting with an Ti-sample, let the vector k — (^i? • • • ? ^n) comprise 
the numbers of clusters in the n observations. Then a is conditionally independent of 
(t, Y, 0, fx, G) given k, and 



P{a I k, r, Y, 0, /x, G) oc P(a)P(k | a) = P(a) J| P(h \ a). 
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Let [3(a,f3) be the Beta function. Generalizing the results of Escobar and West (1995) to 
multiple observations ki, we use the fact that 

r(a) (a + n)/3(a + l,p) 
T(a + p) ctT(p) 

to obtain that P(a \ k) is proportional to 

1 1 
P(a)a^~ n {a + p) n J wf(l - w^dwi x • • • x J <(1 - Wnf^dwn. 



(4.9) 



(4.10) 



Hence, we may view P{a \ k) as a marginal distribution of P(a, wi, . . . , w n \ k) where 
< Wi < 1 are random variables that are conditionally independent of each other given a. 
Writing w = (w\, . . . , w n ), we consider the conditional distribution 



P{a \w,k)oc a a+ £"=i ^"^(a + p) n exp | 



a 



5> 



Expanding (a +p) n gives a mixture of n + 1 Gamma-distributions, namely, 



(a | w, k) 



n / n 

=0 ^ i=l 



E 

i 



n 

E 



log tD, 



with 



5> 



J 



i=l 7 x i=l 

The tfj being conditionally independent given a and fc, it holds that 

(wi | a, fc, W\i) ~ /3(a + l,p). 

Now suppose we observe n independent random vectors Y\, . . . 



(4.11) 

(4.12) 

(4.13) 
i p that follow a 



ip J ,(/i,\D f ) distribution. Once again, let Y and r be the associated matrices of observations 
and divisors. For small a, i.e., few expected clusters, we can create a sampler as follows. Let 
k{ be the number of clusters for the i th observation. The state space consists of the values for 
(G,&,z,rj) where z = {zij} is now annxp matrix and rj an array collecting n vectors of 
length k±, . . . , k n . Following Teh et al. (2006), we propose the following Gibbs sampler. 

Algorithm 4 (Dirichlet t). Starting with a decomposable graph Gq, and initial values /x , zq, 
ao, and rj , repeat the following steps for t = 0,1,2, ... : 

(i) Jointly draw a new graph Gt+i and a new matrix ®t+\ a s in Algorithm^ 

(ii) For each observation i = 1, . . . , n, cycle through the variables j = 1, . . . ,p and draw 
from the conditional given in (4-5). If Zij — z new assign to this new cluster a value f]i Znew by 
sampling from Pj(Tj) in (4-3). This results in a new matrix Zt+\- 

(Hi) For each observation i = 1, . . . , n, cycle through the clusters k = 1, . . . , Ki and draw ijn- 
using (4-6). This results in a new array rj t+1 . 

(iv) Assign n t+ \ as in Algorithm^ 

(v) For each observation i = 1, ... ,n, draw Wi from the conditional given in (4-13). 

(vi) Draw a from the conditional given in (4-H)- 
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5 Simulations 



5.1 AR1 with p=25 

To illustrate the behaviour of the different Bayesian methods, we first present simulations for 
graphs with 25 nodes, for which we run the Markov chain Monte Carlo samplers for 10, 000 



iterations per possible edge, as suggested in Jones et al. (2005). We choose a graph for an 



autoregressive process of order one, that is, the nodes form a chain and the corresponding pre- 
cision matrix is tri-diagonal. We forego simulating random draws of for a clear distinction 
between true positives and negatives. Instead, we set the non-zero off-diagonal elements of 
to —1 and the diagonal elements to 3 (except the first and last, which are set to 2). Unless 
otherwise noted, we fix the degrees of freedom v = 3, the graph prior parameter d = 0.05, 



recall (2.3), and the hyperparameter 6 = 1. It is not desirable, however, to have exactly the 
same priors in the normal and t-models. With the most extreme observations downweighted, 
the entries of the matrix S t yy from the i-model tend to be smaller in magnitude than those 
of the sample covariance matrix S, and, in our experience, the same hyperparameter matrix 
<1? then leads to larger graphs for the t-case. To get graphs of comparable size, we choose 
<1? = Xp/5 for the normal model and = X p /10 for t- models. 

We first simulate n independent normal observations from Af p (0, _1 ). In order to illus- 
trate the effects on inference of dependence structures, we assume the mean to be known and 
run five different estimation methods: 

(a) The normal procedure (Q- 

(b) The normal procedure using the maximum likelihood estimate S+yy from the classical 
t Pi 3-model as the sufficient statistic instead of S. 



(c) The classical t p ^ procedure (£3.2), drawing the matrix r once every 10 edge proposals. 



(d) The Dirichlet t" 3 procedure Q4.2) with a T(l,l) prior on a. We draw new cluster 



identifiers z for every 20 draws of r/, which in turn is drawn once every 10 edge proposals. 



(e) The alternative t* 3 procedure ({ 3.3 ), drawing the matrix r once every 10 edge proposals. 



Assuming known means only favors normal procedures for which estimation of the mean 
from heavy-tailed data is problematic in itself. For each method, we perform 3 million edge 
proposals. We repeat the process 50 times, each time recording the posterior probability 
P{e-jk = 1 | Y), that is, the probability of edge {j, k] to be in the true edge set. If P(ejk = 
1 | Y) > e for some threshold e, we consider it a "positive". We let e range from to 1 and 
record the number of true and false positives in all 50 simulations. We then compare the true 
positive rate to the false positive rate at each threshold. This entire process is repeated for 
data generated from a t p ^{0, _1 ) and a ^3(0, _1 ) distribution. 

The simulation results are summarized in Figure [2j which shows that the more flexible 
models indeed perform better when the data is generated from the more complicated model. 
With normal data, the i-models all perform similarly well and the normal model outperforms 
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Normal n-25, p-25. AR1 



Classical n-25, p-25, AR1 
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Figure 2: ROC curves depicting the performances of the five methods for data generated from 
a A/25(0, _1 ) distribution, a £25,3(0, -1 ) distribution and a ^5 3(0, 0" 1 ) distribution. 
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them only slightly. For classical i-data, the classical t-model performs significantly better 
than the normal model. The alternative model is clearly inferior to the classical model. The 
performance of the Dirichlet i-model, on the other hand, is barely distinguishable from that of 
the classical t-model. With a T(l, 1) prior for a, the Dirichlet method finds an average of 1.4 
clusters per r vector, rendering it very similar to the classical model. The normal procedure 
with the robustified estimate S+yy performs better than the purely normal technique, but 
not as well as the classical t-model. In addition to the superior performance seen here, the 
fully Bayesian approach has, of course, the added benefit of a posterior distribution on r. 
This can be useful to assess of each observation's consistency with normality. 

For alternative i-data, the alternative i-method clearly outperforms its normal and clas- 
sical t-analogues, which perform equally poorly. Only the Dirichlet t-model, which finds an 
average of 7.7 clusters per r vector, performs comparably to the alternative model. Its strong 
performance on both classical and alternative data suggests that the Dirichlet method is 
indeed an effective compromise between the two other i-distribution techniques. 

For classical t-data, fitting the classical, the Dirichlet and the alternative f-model took on 
average 2.3N, 3.6N and 4.5iV, respectively, where N is the processing time for the normal 



model. Based on use of 'R' (R Development Core Team, 2010), the times are meant only 



to be rough estimates of actual computational complexity. Nevertheless, the comparison 
suggests that the Dirichlet approach adaptively produces statistically efficient estimates while 
using a run-time about halfway between that of the classical and alternative procedures. 
For alternative t-data, the Dirichlet model faces the added complexity of reclustering steps 
without much benefit from any clustering. Indeed, the average run time was 5N for the 
Dirichlet model compared to 3.8N for the alternative model. This said, we would not expect 
any real application requiring as large a number of clusters as simulated alternative i-data. 

5.2 Random Graphs with p=100 

For a more challenging scenario, we consider p = 100 nodes and create the graph by forming 
20 random cliques of size 2 to 5. For each clique, we pick nodes at random and form edges 
between all nodes in the clique. We draw the mean vector /x as a p-vector of independent 
standard normals. We set the non-zero entries in O as before (but multiply the diagonal 
entries by a constant to ensure a minimum eigenvalue of at least 0.6). We then simulate 
n = 100 independent observations from a Ap(0, _1 ) distribution to create latent data X. 

Next, we contaminate the data via an n x p matrix r that holds divisors for X, with 
the goal of creating contamination in many parts of many observations so that detecting 
outliers manually would be difficult. For each one of a total of 10 contaminations, we draw a 
Poisson number of rows and a Poisson number of columns (mean 10 for both). We then select 
uniformly at random a submatrix of r that has this given size and assign a single random 
value (uniform[0.01,0.2]) to the entries in the submatrix. The remaining entries of r are set 
to 1 . The observations Y are created by setting Y{j = fij + Xij / Tij . This results on average 
in contamination in slightly less than one in ten elements of the latent data matrix. 



We run the five algorithms from Section 5.1 under the settings described there, but with 
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Contaminated Normal n=100 p=100 Random Graph 




False Positive Rate 



Figure 3: ROC curves depicting the performances of the five methods for data generated from 
a contaminated A/iqo(0, distribution that is Markov to a random graph. 

<1? = Xp/5 for the normal model and 3> = X p /20 for t-models, to get graphs of comparable 
size. We run the samplers to obtain 2 million edge draws and for the t models we sample t 
every 30 edge draws. The results for 25 repeats of the entire process are shown in Figure [3j 
which makes a clear case for the Dirichlet t-model. 

The 2 million draws do not seem enough for "convergence" of all Markov chains. Figure |4j 
which shows the number of edges in the estimated graph over the iterations of the samplers, 
suggests that the alternative model may require much longer runs. As another plus, the 
Dirichlet model seems to require far fewer iterations. 

6 Gene Expression Data 

Gasch et al.| (|2000|) present data from microarray experiments with yeast strands. As in the 



17 



Normal Normal (tlasso) Classical 
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Dlrichlet Alternative 




Oe+00 1e+05 2e+05 3E+05 4e+05 5e+0S 500000 1000000 1500000 2000000 

Figure 4: Edges in estimated graph plotted against iterations of the Gibbs sampler for models 
used in the contaminated normal simulations. 
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Posterior Probability of Tau Being in the Lowest 5% Quantile 



Figure 5: Lighter colors represent higher posterior probability that Tij is in the bottom 5% 
quantile of the prior distribution on r. The light blocks on the bottom right correspond to 
the "outliers" in the yeast data from Section [6j 



frequentist work in Finegold and Drton (2011), we focus on 8 genes involved in galactose 
utilization; 136 experiments have data for all the 8 genes. In 11 experiments, 4 of the genes 
have abnormally large negative expression values. Figure [5j which is obtained from 3 million 
iterations of our sampler, shows that the Dirichlet if g 1 model leads to downweighting of the 
abnormal values but not entire observations, as desired. 

To demonstrate automatic detection of outliers in a much larger dataset, we add the data 
for 92 genes, selected at random from those without missing data. We fit the models t^ 10Q for 
fixed a = {1, 10, 100}. We run the samplers for 2 million edge draws and follow a proposal 
of Donnet and Marin (2010) to select candidate edges with probability proportional to the 
sample correlation of the corresponding variables. Figure [6] provides convergence diagnostics. 

For a = 100 the propensity to cluster is very weak with an average of 67 different clusters 
in each row of r in the final iteration. Nonetheless, the average r value for the 4 relevant 
genes in the 11 relevant experiments was 0.046 compared with an average value of 0.84 for the 
rest of the matrix. That is, as we let the Dirichlet t model approach the alternative model, 
we achieve downweights of the suspected outliers even without the benefits of clustering. 

With a = 10 the propensity to cluster is a bit stronger; the rows of r have an average of 
20 different clusters in the final iteration. We achieve relative downweighting similar to the 
a = 100 model (0.067 to 0.47). Despite 20 clusters per row, at least two of the outliers always 
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Classical 



Di rich let atpha=1 




Figure 6: Number of edges in estimated graph plotted against iterations of the Gibbs sampler 
for models used in analysis of gene expression data. 

share a cluster and in five of the experiments, at least three share a cluster. 

Letting a. = 1, we achieve much greater clustering (2.8 per row). The suspected outliers 
tend to cluster together, but there is little practical difference between this model and the 
classical t. The classical t downweights most of the observations significantly and takes much 
longer to reach the appearance of equilibrium in the estimated edges; see Figure [6j 



7 Discussion 



We have extended Bayesian approaches to graphical Gaussian modeling using three variations 
of multivariate i-distributions. While these extensions all come at increased computational 
expense, they can have substantial statistical benefit. In particular, one obtains a posterior 
distribution for latent weights that measures uncertainty about potential outliers. 

Our extensions to t-distributions are based on one particular Gaussian model, but many 
other variations have been treated in the literature. Some authors place a prior on the degrees 
of freedom parameter 5 for the Hyper Inverse Wishart (HIW) prior distribution. In addition, 
one can introduce a prior on the edge density parameter d from (2.3). An alternative approach 
is to place a uniform prior on the size of the graph, and then a uniform prior on all graphs 
of the same size. We have chosen the scale matrix of the HIW prior to be $ = cl p and 
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treated fixed choices of c in our simulations. One could instead include a prior on c, or set 
<I> = cA, where A is the sample covariance matrix, or an equicorrelated matrix with diagonal 



elements equal to 1 and off-diagonal elements equal to a common value p. Armstrong et al. 



(2009) consider all of these variations, including placing a prior on p. Finally, as mentioned 
in the introduction, there now exist more flexible versions of the HIW distribution as well 
as techniques for approximate computations for non-decomposable graphs. Incorporating 
the former generalization in the t-distribution context would be straightforward; addressing 
non-decomposable graphs, however, constitutes an interesting area for further research. 

As noted previously by many authors, for large p even the most likely graphs may have 
very small posterior probability, making it difficult and not necessarily very informative to 
identify the graph with highest posterior probability. In practice, the focus may thus often 
be on more modest goals, such as the posterior distribution of subgraphs on some subset of 
vertices, or even more simply, the marginal posterior probability that each edge is in the edge 
set E that we considered in the simulation study. 

In the data from Section [6j a group of genes had extreme expression values in several 
observations. While the Dirichlet t-model did a good job identifying these clusters, it could 
potentially be worthwhile to share statistical strength by explicitly modeling the clustering 
of latent weights as similar across observations. This could be done by treating the p-vectors 
Ti,...,r n as draws from a Dirichlet process mixture model. Combined with the Dirichlet 
model, this would give a 'doubly Dirichlet' r matrix. That is, we will have k < p distinct 
elements within each row (observation) and I < n distinct rows. Inference in this model would 
be more involved than in the ordinary Dirichlet t-model we discussed, but the full conditionals 
necessary to devise a Gibbs sampler would be of similar type. 
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A Appendix 



A.l Alternative Procedures for Drawing the Mean Vector n 



The parameters of the normal distribution in (3.10) involve the inverse of the sum of matrices 
Y17=l r «® + ®M' whe-re &/i = I p /a a is a multiple of the identity matrix. With rare exceptions, 
the hyperparameter a u , a variance, is chosen large so as to make the prior distribution on fj, 
flat. In those cases, 0^ is small compared to the typical values of Y27=i r «®- Hence, little 
is lost by ignoring the term 0^ in the matrix inversion, which leads to the distributional 
approximation 



(H | Y,G,r,@)^M p 



En 
i=l T i 



V-V71 







(A.l) 



This approximation still requires the completion step we have avoided to obtain 1 = 
While this is not prohibitively expensive, we find that simply setting fi equal to the mean 
of the distribution in (A.l) works well enough in practice - we call this "Robust Centering". 



To test this, we simulated classical t data from a chain graph with 25 nodes as described in 
Section [5} We ran four versions of the the Dirichlet t algorithm (Algorithm 4) starting with 
the same seed: using naive centering (subtract of the sample mean for each variable and set 
/j, = 0); Robust Centering; sampling from the approximate conditional in equation (A.l); and 



sampling from the exact conditional in (3.10). With a a set to 100,000 we find virtually no 



difference between the estimated values of fi from the last three procedures, but significant 
difference between those three and the first. The first procedure does a worse job of estimating 
/x and, as a result, 0. We conclude that Robust Centering is better than naive centering, but 
that virtually nothing is lost by failing to sample from the approximate or full conditionals. 



A.2 Full Conditional for Latent Divisors in the Dirichlet t-Model 

For notational convenience, let Zj = if tj belongs to a new cluster and consider the case 
j = p. Let K be the number of distinct clusters containing elements other than r p . We may 
then write 

K 

P{t p < t | T\ p , Y,&) = J2 p ( T P < t I tsj,, Y, z p = k)P{z p = k | T\ p , Y, 0). 

fc=0 

The conditional density of t p given (t\ p ,Y,@, z = k) is trivially the point mass at the 
value assumed by all elements of r that belong to the k th cluster. The conditional density of 
t p given (T\p, Y, 0, z = 0) is 

/(Tp | T\p, Y, 0, Z = 0) CC /(Tp | T\ p , 0, Z = 0)f(Y I T, ®,Z = 0) 

= f(T p \z = 0)f(Y\T,@) 

oc f v (T p ;v/2,v/2)f N (Y p ;Hc/^,ol/T p ), (A.2) 
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where fr(T p ; v/2, v/2) is the density of a T(y/2 : v/2) distribution evaluated at r p . The dis- 
tribution specified by ( |A.2 ) is the Gibbs sampling distribution from the alternative model in 



(3.12). Now, 



P(z p = | rvp, Y, 0) oc P{z p = 0,Y p \ Y\ p , T \ p , 



P(z p = | r\ p , Y\ p , &)P(Y p | X\ p , 0, z p = 0) 



oc aP{Y p | X\ p , 0, . 



0). 



If Zj = then Xj and Tj are conditionally independent of (Y\j,T\j) and each other given 
X\j. Let fj, c and a c be the conditional mean and standard deviation of (Xj \ X\j). Therefore, 
given X\j, the random variable Yj/a c has the same distribution as 



for Z ~ A/"(0, 1). We recognize the distribution as a noncentral i-distribution with degrees of 
freedom v and noncentrality parameter [i c /a c . 
Similarly, for k > 0, 

P{z p = k | t\ p ,Y,@) oc P(Y P | T\ p ,y\ p ,0,zp = k). 

Now, if z p = k, then Yp = X c j ^Jt\ where X c ~ M(fi c ,a c ). Therefore, the conditional 



distribution of Y p given 0, Y\^, T\ p , and Zp — k IS 



Combining all the above elements gives the result stated in (4.5) 



A. 3 Full Conditional for Cluster Values in the Dirichlet t-Model 

Define (k) = {j : Zj = k} and \(k) = {1, . . . ,p} \ (k). First, note that the pair (Y( k yr] k ) is 
conditionally independent of Y\r k \ given X\( k y Hence, 

f(Vk I V\k,Y,@,z) = f{rj k | Y {k) ,X\ {k) ,z) oc f(rj k ,Y {k) \ X\ {k) ,z). 

The last density is equal to 

(A.3) 

s/Vk Vk J 

where /x c and 5] c are the conditional mean vector and covariance matrix of X^ given X\/ k y 
By the well-known formulas for inverse of a partitioned matrix, @( k y k ) is the inverse of S c , 



f(Vk I x \(k))f ( Y {k) I X\ {k) ,r] k ,z) = f T (r) k ;v/2,v/2) ■ fa \Y {k y, 



and /x c is equal to S^y^S^^yX^yThe product in (A.3) is thus proportional to 



rfj 2 1 exp{-7? fc i//2}|7? fc (fc){fc) | 1/2 expi -- 



(A) 



Vk®(k)(k) Y (k) 



(u+\{k)\)/2- 



1 exp |-? ?fc u/2 + ^(©(^(^Y^yJ.))^ - V%^(0W\(i) x \(i) y [fc))} . 



which is the claim of (4.6). Note that when (k) is a singleton, we get the conditional distri- 



bution for the alternative model. 
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